Genome Research — Latest Matching Preprints

1

CuGen: A GPU-accelerated framework for large-scale genomics

Kiiskinen, T.; Richland, J.; Wang, W.; Lu, W. S.; Balasubramanian, N.; Hastie, T.; Tibshirani, R.; Rivas, M. A.

2026-07-17 genetic and genomic medicine 10.64898/2026.07.15.26358178 medRxiv

Top 0.9%

6.2%

Show abstract

Biobank-scale genomic analyses remain computationally expensive, CPU-bound workflows, particularly when adjusting for confounding. Here, we present CuGen, a GPU-accelerated framework for large-scale genomics. CuGen uses UltraLasso, a novel hierarchical application of univariate-guided sparse regression (uniLasso), to select a compact, phenotype-informed active set of fewer than 30,000 variants. This achieves robust leave-one-chromosome-out (LOCO) confounding control, enabling both downstream GWAS and in-sample fine-mapping. Additionally, we introduce the .cugen file format, a genotype representation designed for memory-optimized, high-throughput streaming and random access on GPU hardware. Building on this substrate, we provide a general GPU-accelerated genomics toolkit handling polygenic prediction, data manipulation, quality control, analysis, and visualization. We demonstrate CuGen's efficacy in the UK Biobank with up to 408,624 individuals, where the full GWAS pipeline and fine-mapping against 6.8 million imputed variants completes in approximately 10 minutes on a single high-throughput GPU with 80 GB of memory. The pipeline scales efficiently to massive phenome-wide analyses with sublinear resource consumption.

2

Transcription controls chromatin-nuclear lamina contacts through distinct Lamin A and LBR tethering mechanisms

Bernasconi, M.; Breda, J.; van Schaik, T.; Manjon, A.; Zambelli, F.; Pavesi, G.; Medema, R. H.; Muzi-Falconi, M.; van Steensel, B.; Manzo, S.

2026-07-15 genomics 10.64898/2026.07.14.738400 medRxiv

Top 3%

2.6%

Show abstract

Lamina-associated domains (LADs) are large genomic regions that interact with the nuclear lamina (NL). Much of the underlying "grammar" governing their positioning at the nuclear periphery remains unclear. LADs are composed of heterochromatin and typically harbor repressed genes, and their association with the NL is generally incompatible with strong transcriptional activity. The extent to which transcription globally shapes chromatin-NL interactions is not fully understood. Here, we combined acute transcription inhibition using Flavopiridol or Triptolide with genome-wide mapping of chromatin-NL contacts. We found that chromatin-NL interactions are rapidly rewired upon transcription inhibition. Changes in chromatin-NL contacts upon transcription shutdown are predictable based on transcriptional activity and the presence of H3K9me3-marked heterochromatin. This rewiring is reversible, as genome-NL interactions quickly return to baseline levels following drug wash-off. Notably, gain and loss of chromatin-NL interactions upon transcription shutdown reflect two distinct tethering mechanisms. Inter-LADs genomic regions (iLADs) enriched in highly active genes and located near stable LADs, which are tethered by Lamin A (LMNA/C), become re-attached to the NL following transcription inhibition. In parallel, H3K9-methylated regions tethered to the nuclear envelope by the Lamin B receptor (LBR) undergo extensive detachment from the NL. Strikingly, LMNA/C and LBR oppositely regulate transcription-sensitive LADs and are required for transcriptional control of chromatin-NL contacts. Together, our findings highlight the plasticity and dynamic nature of chromatin-NL interactions and provide the first evidence that LMNA/C- and LBR-mediated tethering mechanisms exhibit distinct sensitivities to transcription inhibition. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=155 SRC="FIGDIR/small/738400v1_ufig1.gif" ALT="Figure 1"> View larger version (46K): org.highwire.dtl.DTLVardef@1c35abborg.highwire.dtl.DTLVardef@79c483org.highwire.dtl.DTLVardef@547ce8org.highwire.dtl.DTLVardef@d4790e_HPS_FORMAT_FIGEXP M_FIG C_FIG HIGHLIGHTSO_LITranscription inhibition alters chromatin-NL contacts rapidly and reversibly C_LIO_LIActive transcription prevents inter-LADs located near LMNA/C-tethered LADs from associating with the nuclear lamina. C_LIO_LILBR-tethered heterochromatin is repositioned away from the NL C_LIO_LITranscription-dependent modulation of chromatin-NL contacts is dependent on LMNA/C and partially on LBR C_LI

3

Time-dependent transcriptomic changes following protoplast isolation in plants

Zhang, H.; Sangra, A.; Giabardo, A.; Wood, J. C.; Brose, J.; Cloud, S. S.; Hamilton, J. P.; Mailloux, K.; Vaillancourt, B.; Buell, C. R.; Schmitz, R. J.

2026-07-15 plant biology 10.64898/2026.07.14.738454 medRxiv

Top 3%

2.1%

Show abstract

Protoplast isolation is widely used for plant functional genomics and single-cell analyses, but its impact on transcriptional and cell state dynamics remains incompletely understood. Here, we generated time-course RNA-seq data from leaf protoplasts of Arabidopsis, maize, and poplar, sampling at multiple time points following isolation, to systematically characterize global transcriptional dynamics across species. We identified two major drivers of transcriptional variation: a persistent protoplast isolation effect and a progressive time-dependent transcriptional program, which can be divided into early, middle, and late stages corresponding to an immediate stress response, metabolic and chromatin regulation dynamics, and sustained metabolic and proteostasis regulation, together with species-specific differences across stages. We observed a rapid loss of cell-type-specific transcriptional signatures within 6 hours in Arabidopsis and maize, whereas poplar showed a slower decline. Single-nucleus RNA-seq at 6 hours in maize confirmed attenuation of cell-type-specific transcriptional structure. Furthermore, leveraging this time-course dataset enables the identification of aberrant cell states in single-cell RNA-seq data, exemplified by clusters showing elevated activity of protoplast isolation-associated, middle-, and late-stage transcriptional programs characteristic of stress-like states. Together, our results provide a cross-species framework for dissecting protoplast-induced transcriptional and cell state dynamics and facilitate the systematic identification of stress-associated cell states in single-cell transcriptomic data.

4

Privacy-Preserving Matching for Federated Causal Inference in Multicentre Patient Cohorts

Gusinow, R.; Morgan, A. S.; Canziani, L. M.; Zeitlin, J.; Kim, M.; Gentilotti, E.; Ghosn, J.; Florence, A.-M.; Tami, A.; Toschi, A.; Palacios-Baena, Z. R.; Tacconelli, E.; Hasenauer, J.

2026-07-19 epidemiology 10.64898/2026.07.16.26358171 medRxiv

Top 4%

1.4%

Show abstract

Causal effect estimates can often be biased in clinical and epidemiological studies as patient cohorts frequently exhibit substantial covariate imbalances between treated and control groups, often amplified in multicentre studies due to heterogeneous recruitment, clinical practice, and case mix. Covariate balancing methods are therefore essential for valid causal inference. However, their application becomes challenging when data are distributed across cohorts and cannot be pooled because of privacy, legal, or institutional constraints, leaving a gap in practical methods for causal effect estimation in federated and imbalanced clinical data settings. We develop a privacy-preserving framework for covariate balancing and causal effect estimation across distributed data providers, combining federated aggregation with differential privacy to enable propensity score subclassification and matching without sharing individual-level records. Matching relies on non-disclosive quantities and differentially private distance evaluation, and the resulting matched subsets remain local to each server. Balance can be assessed through federated diagnostics and privacy-preserving visualisations, and we provide secure estimators for average treatment effects with associated uncertainty quantification. We implement this framework in the DataSHIELD federated analysis platform via 2 R packages. In simulations, we demonstrate agreement between federated and centralised analyses in the absence of privacy noise and quantify the bias--variance trade-offs induced by differential privacy. We illustrate applicability in two multinational settings-a Long COVID cohort and very preterm birth cohorts-showing that the approach enables practical causal analyses under real-world data protection constraints. The DataSHIELD packages are available on Github. Additional methodological details are provided in the Supplementary Material.

5

Axolotl regeneration reveals a dormant cis-regulatory grammar conserved across vertebrate genomes

Fujiwara, T.; Nakanishi, K.; Suzuki, T.; Shimizu, H.

2026-07-15 systems biology 10.64898/2026.07.13.738357 medRxiv

Top 5%

1.1%

Show abstract

Unlike most mammals, which lack the capacity to regenerate complex tissues following injury, other vertebrates such as the axolotl rebuild complete limbs throughout life, yet the regulatory mechanisms underlying this striking difference have remained elusive. Here, we define the core cis-regulatory motif grammar driving axolotl limb regeneration and demonstrate that this grammar is conserved within the syntenic neighborhoods of regeneration-gene orthologs in human and mouse genomes, despite being epigenetically sealed in adult mammalian tissues. Integrating this cross-species grammar projection with AlphaGenome, a multimodal genomic AI capable of predicting epigenomic states from long sequence context, we find that the highest-ranking candidate loci are predicted to occupy a state of bivalent dormancy marked by the co-enrichment of poising and repressive histone modifications alongside suppression of transcriptional activity and chromatin accessibility. Systematic in silico motif perturbation further predicts that this dormant state is actively enforced by specific dormancy-stabilizing sequence elements, and that disrupting these elements shifts candidate loci toward a regeneration-competent chromatin configuration. Our findings support a model in which the regenerative blueprint has not been erased from the mammalian genome but locked within it, opening new avenues for understanding the evolution of regenerative competence and for the rational reactivation of latent regenerative programs.

6

Characterizing the impact of plasma protein levels on human brain structure and disorders leveraging integrative multi-omics analysis

Ayubcha, C.; Dennis, E.; Bhattacharyya, U.; John, J.; Lam, M.; Lencz, T.; Ge, T.; Chen, C.-Y.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.13.26358006 medRxiv

Top 5%

1.0%

Show abstract

With recent advances in high-throughput proteomic technologies, population-scale plasma proteomics datasets, often linked to extensive genetic and phenotypic information, have become increasingly accessible. Yet the relationships between circulating protein levels, brain imaging phenotypes, and risk for neurological and psychiatric disorders remain largely unexplored. Proteome-wide association studies offer a promising approach for elucidating biological mechanisms that connect genetic variation to complex brain-related traits and diseases. In this study, we integrated protein quantitative trait loci (pQTLs) from the two largest plasma proteomic resources (the UK Biobank Pharma Proteomics Project [UKB-PPP] and Ferkingstad et al. [deCODE]) with genome-wide association studies of brain imaging-derived phenotypes in UK Biobank using Mendelian randomization and colocalization analyses. We identified 120 cis and 20 trans associations between plasma proteins and imaging phenotypes and validated these findings using brain tissue-derived proteomic and transcriptomic datasets. Multivariable Mendelian randomization revealed eleven plasma proteins (coding genes APOE, ARL3, MICB, NSF, RHOC, RSPO3, ENPP2, BTN2A1, EIF2AK3, MRVI1, and OPLAH) with significant direct effects on the risk of Alzheimer's disease, Parkinson's disease, multiple sclerosis, bipolar disorder, and schizophrenia. Single-cell expression and pathway enrichment analyses further revealed cell-type-specific effects and distinct biological processes underlying these protein-disease associations. Together, these findings demonstrate robust links between plasma protein variation and brain structure, delineate protein-disease pathways, and highlight the cellular and molecular mechanisms that contribute to neurobiological diversity and pathology.

7

Multi-tissue analyses of allele-specific chromatin accessibility nominate likely functional variants for type 2 diabetes

Narisu, N.; Li, H. X.; Rathbun, C. J. M.; Varshney, A.; Swift, A. J.; Yan, T.; Sinha, N.; Currin, K. W.; Xue, D.; Robertson, C. C.; Taylor, D. L.; Taylor, H. J.; Beck, A.; Lee, B. N.; Wang, L.; Broadaway, K. A.; Wilson, E. P.; Stringham, H.; Saramies, J.; Lakka, T. A.; Spracklen, C. N.; Scott, L. J.; Stitzel, M. L.; Tuomilehto, J.; Laakso, M.; Koistinen, H. A.; Boehnke, M.; Arda, H. E.; Chen, S.; Biesecker, L. G.; Bonnycastle, L. L.; Erdos, M. R.; Mohlke, K. L.; Parker, S. C. J.; Collins, F. S.

2026-07-15 health informatics 10.64898/2026.07.14.26358094 medRxiv

Top 6%

1.0%

Show abstract

Genome-wide association studies (GWAS) have identified >1,200 signals associated with type 2 diabetes (T2D), yet identifying functional variants remains challenging because the majority of them lie in noncoding regions of the genome and are in areas of high linkage disequilibrium (LD). While chromatin accessibility QTL (caQTL) and expression QTL (eQTL) analyses are useful for nominating regulatory mechanisms underlying GWAS signals, limitations still exist in pinpointing functional variants within regions of high LD. A complementary approach that has been less frequently applied is to focus on the allele-specific effect on chromatin accessibility at heterozygous single-nucleotide polymorphisms (SNPs), hereafter referred to as allelic imbalance. We analyzed the allelic imbalance of reads generated from an assay for transposase-accessible chromatin with sequencing (ATAC-seq) across genotyped samples from 490 donors in T2D-relevant tissues: skeletal muscle, liver, pancreatic islets, adipose tissue, and relevant cell types. We identified 119,949 allelically imbalanced SNPs (FDR<0.05) across the genome. The allelic imbalance was often most prominent in one tissue and showed an enrichment overlapping with tissue-specific transcription factor (TF) binding footprints. Focusing on the 8,581 SNPs in previously published 99% credible sets from 338 T2D GWAS signals, we identified 256 imbalanced SNPs across 123 (36.4% of) signals, each showing allelic imbalance in at least one tissue or cell type. Of these, 71 signals contained only a single imbalanced SNP, representing excellent candidate causative variants. As a proof-of-concept, we showed that 23 of the 256 imbalanced SNPs were supported by allelic assays from previous studies. Further, we experimentally validated two imbalanced SNPs as likely functional variants: rs34584161 among a seven-SNP T2D credible set at the RNF6 signal in islets and rs849134 among a 13-SNP credible set at the JAZF1 signal in liver. This study demonstrates the power of integrating ATAC-seq allelic imbalance (ASAI) with GWAS statistical fine-mapping to identify candidate functional regulatory variants from among tightly linked GWAS variants in disease-relevant tissues. While applied here in T2D, this approach represents a widely applicable high-throughput framework for refining the genetic architecture of complex traits.

8

PRANA: A Deep Learning Method for Adapting Polygenic Risk Scores to Diverse Ethnic Groups

Levi, H.; The Breast Cancer Association Consortium, ; Michailidou, K.; Elkon, R.; Shamir, R.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.12.26357860 medRxiv

Top 6%

0.8%

Show abstract

Polygenic risk scores (PRSs), which quantify inherited susceptibility to complex traits and diseases, have emerged as valuable tools for risk stratification and precision medicine. Despite their promise, PRS developed on European cohorts often demonstrate substantially reduced predictive accuracy in non-European populations, due to differences in genetic architecture. The disproportionate representation of European ancestry cohorts in genome-wide association studies (GWAS) leads to inequitable deployment of PRS technologies across diverse populations. Here, we introduce PRANA (Polygenic Risk Adaptation via Neural-network Architecture), a deep learning framework that adapts an existing PRS developed on one population to other ancestries. Unlike methods that require large-scale GWAS in the target population, PRANA leverages pre-trained PRS models derived from European cohorts and adapts them using modestly sized cohorts from the target population. We evaluated PRANA on seven complex traits in South Asian, East Asian and Ashkenazi Jewish populations, as well as in selected smaller East Asian subpopulations where the scarcity of training data poses a particular challenge. PRANA mostly improved predictive performance of the baseline PRS models by 5%-20% in terms of effect size and Nagelkerke's R^2, and, in most cases, outperformed existing cross-ancestry multi-PRS approaches. These results highlight PRANA as a scalable and practical strategy to reduce disparities in genomic risk prediction and advance the equitable application of PRS in diverse populations.

9

Empirical estimation of multiple-testing burden for population-based HLA association studies using sequencing-derived HLA alleles across genetic ancestries

Taliun, D.; Gagliano Taliun, S. A.

2026-07-15 genetics 10.64898/2026.07.12.738059 medRxiv

Top 6%

0.8%

Show abstract

As population-scale whole-genome sequencing datasets continue to expand, they enable genetic association studies beyond single-nucleotide variants to more complex forms of genetic variation, including classical human leukocyte antigen (HLA) alleles. The HLA region comprises nine highly polymorphic classical HLA genes in extensive linkage disequilibrium that are associated with numerous autoimmune and infectious diseases. However, unlike genome-wide association studies of single-nucleotide variants, there is no general guidance for controlling the multiple-testing burden in HLA allele association analyses. Here, we systematically evaluated the effective number of independent HLA allele tests using sequencing data from diverse genetic ancestries, analytical derivation and simulations. We show that the multiple-testing burden depends on genetic ancestry, allele frequency, and the phenotype model, but remains remarkably stable across minor allele count thresholds, corresponding to approximately 60-70% of the total number of tested HLA alleles. Simulations further demonstrate that the effective number of tests can exceed 90% under realistic disease models. Analyses of 4-field HLA alleles from long-read sequencing showed that higher typing resolution increases the number of alleles but preserves the underlying correlation structure and scales the effective number of independent tests proportionally. Our results provide practical guidance for HLA association studies and support Bonferroni correction based on the total number of tested HLA alleles as a simple and robust approximation when permutation-based approaches are impractical.

10

Local ancestry-informed rare variant burden testing improves gene discovery in admixed populations

Kore, P.; Tan, T.; Lu, W.; Manuel-Friedman, A.; Hu, L.; Chatterjee, N.; Zhou, W.; Dhindsa, R. S.; Atkinson, E. G.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.13.26357993 medRxiv

Top 6%

0.8%

Show abstract

Rare-variant association studies enable the discovery of high-impact genetic contributors often missed by conventional genome-wide association studies focused on common variation. However, standard burden tests aggregate variants without accounting for local ancestry in admixed genomes, reducing power when rare variant frequencies or genetic effects differ across ancestral backgrounds. Here, we introduce Tractor-Burden, an ancestry-aware gene-based association method that partitions rare-variant burden by inferred local ancestry and estimates ancestry-specific effects within a unified regression. In simulations, Tractor-Burden is well calibrated and improves power over standard burden tests under effect heterogeneity. Applied to whole-genome sequencing data from 47,152 admixed African-European individuals in the All of Us Research Program, Tractor-Burden recapitulates known associations, including ancestry-enriched effects at LDLR, and identifies additional suggestive genes and pathways for type 2 diabetes. Tractor-Burden extends rare-variant association testing to admixed genomes and provides a scalable framework for detecting and interpreting gene-level effects across local ancestry backgrounds.

11

SF3B3 / SF3B5 form a metazoan specific transcription module of the U2 snRNP that coordinates Pol II elongation in a splicing independent manner

Vassiliadis, D.; Balic, J. J.; Braniff, O.; Gillespie, A.; Rothnie, W.; Prest, K.; Sinclair, O.; Das, A.; Ang, C.-S.; Dawson, M. A.

2026-07-15 molecular biology 10.64898/2026.07.14.737342 medRxiv

Top 8%

0.5%

Show abstract

Co-transcriptional splicing is a conserved feature of eukaryotic gene expression. However, establishing the functional nature of this process has been difficult. Here using high throughput CRISPR/Cas9 screens we surprisingly find that SF3B3, the third largest subunit of the U2 snRNP complex, is a major regulator of RNA Pol II pause release and processivity. Remarkably, the absence of SF3B3 dramatically perturbs transcription but U2 snRNP assembly and RNA splicing remains unaffected. Mechanistically, SF3B3 coordinates the chromatin occupancy of transcriptional kinases (CDK9/12/13) alongside the PAF1c and Integrator complexes to regulate Pol II. Structure / function analyses of SF3B3 revealed that a metazoan specific 18aa sequence within its disordered tail phenocopies its loss and mediates the physical association and stability of SF3B5. We show that loss of SF3B5 mirrors SF3B3 deficiency suggesting this submodule, although resident within the U2 snRNP complex, evolved to primarily coordinate RNA Pol II in a splicing-independent manner.

12

Muscle proteins in plasma associate to distinguished phenotypes in amyotrophic lateral sclerosis

Azizi, L.; Aksoylu, I.; Bueno Alvez, M.; Foucher, J.; Juto, A.; Seitz, C.; Press, R.; Samuelsson, K.; Kläppe, U.; Uhlen, M.; Edfors, F.; Bergström, S.; Fang, F.; Nilsson, P.; Öijerstedt, L.; Manberg, A.; Ingre, C.

2026-07-16 neurology 10.64898/2026.07.14.26357727 medRxiv

Top 8%

0.5%

Show abstract

Background: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease characterized by death of upper and lower motor neurons, usually presented with clinical heterogeneity. Fluid biomarker development remains dominated by neurofilament light chain (NEFL), a marker of neuroaxonal injury. NEFL is however unspecific to ALS and its phenotypes and there is currently a lack of biomarkers that capture ALS heterogeneity such as onset site and ALS-frontotemporal spectrum disorder (ALS-FTSD). Therefore, we investigated whether plasma proteomics could reveal pathway-level signatures that stratify and explain ALS heterogeneity. Methods: We profiled ~5,400 plasma proteins (Olink Explore HT) in 299 patients with ALS and 50 age- and sex comparable healthy controls. We used two complementary analytic frameworks: (i) differential protein abundance analysis to identify altered proteins in ALS and across clinical subgroups, and (ii) weighted gene correlation network analysis (WGCNA) to identify coordinated protein modules and relate them to ALS diagnosis and to ALS-specific clinical traits (site of onset, ALS-FTSD, ALS functional rating scale-revised (ALSFRS-R) score, and plasma NEFL). Results: Differential abundance analysis identified 56 proteins altered in ALS versus controls, of which 40 were increased. WGCNA identified 11 co-expression modules, with ALS samples having the strongest correlation to a protein module (n=51) highly enriched for muscle-related proteins. Out of the 40 proteins that had increased expression levels, 29 overlapped with the muscle-enriched protein module, indicating that muscle related proteins are the dominant circulating proteomic signature in ALS. This signal extended to clinical stratification: spinal-onset patients showed a strong positive association with the muscle-module. Further, differential abundance analysis of spinal- versus bulbar-onset ALS identified changes that mapped predominantly to the same module, supporting a molecular signature of onset phenotype. In contrast, cognitive status (ALS-FTSD) mapped to distinct modules enriched for extracellular matrix/cell-adhesion pathways, consistent with a separable biological axis of disease heterogeneity. Although multiple modules correlated with NEFL, trait-specific signatures were not fully explained by neuroaxonal injury. Notably, the muscle-enriched module increased with higher NEFL and lower ALSFRS-R, supporting its interpretation as a severity-linked, muscle-involvement proxy. Conclusions: Large-scale plasma proteomics reveals that heterogeneity in ALS reflects underlying biological structures. We identified a dominant muscle-associated protein network that distinguished ALS patients from controls and correlated with disease onset phenotype and severity, alongside distinct protein networks linked to ALS-FTSD. By integrating differential protein abundance with network-based analysis, we defined pathway-level biomarker signatures that extend beyond NEFL, enabling biologically informed patient stratification and improved therapeutic monitoring.

13

Epithelial Stem Cell Fate Determines Chemoradiotherapy Response in Rectal Cancer

Li, N.; Ishaqwala, F.; Wright, T. A.; Wilkinson, A.; Vlckova, P.; Trevers, K.; O'Sullivan, R.; Crampsie, S.; Basiarz, E.; Vanderkamp, S.; McCulloch, A. K.; Dobric, A.; Krishnaswamy, S.; Vanhaesebroeck, B.; Glasgow Serial Sampling Consortium, ; Roxburgh, C. S. D.; Hawkins, M.; Tape, C. J.

2026-07-15 cancer biology 10.64898/2026.07.15.736775 medRxiv

Top 8%

0.5%

Show abstract

Rectal cancers are often treated with neoadjuvant chemoradiotherapy (CRT), yet 85% of patients do not achieve a pathological complete response. To identify the molecular determinants of CRT response, we profiled the single-cell signalling, DNA-damage, cell-cycle, apoptotic, and cell-fate responses of 2,769 patient-derived organoid cultures treated with CRT, cancer-associated fibroblasts (CAFs), and signal-rewiring agents. We find that CRT response is determined by stem cell-fate. CRT triggers comparable DNA-damage in isogenic proliferative (proCSC) and revival (revCSC) colonic stem cells, but proCSC retain damage and die whereas revCSC resolve damage and persist. Both CRT and CAFs drive proCSC to a common treatment-resistant revCSC fate and high revCSC predicts worse survival in patients. Pharmacologically constraining stem-cell plasticity increases CRT sensitivity, and Spatial Perturbation of ARrayed Tumour Assembloids (SPARTA) confirms YAP/TEAD inhibition improves chemotherapy responses in human stromal-tumour models. These results suggest that cancer cell-fate, not genotoxic damage itself, ultimately governs response to standard-of-care chemoradiotherapy. HIGHLIGHTSO_LIRectal cancer stem cell-fate determines chemoradiotherapy-induced apoptosis C_LIO_LIproCSCs retain DNA-damage and die, whereas revCSCs repair damage and persist C_LIO_LICAFs and chemoradiotherapy converge on a common chemo-radioresistant revCSC state C_LIO_LISPARTA reveals TEAD inhibition blocks DNA-repair persisters in stromal assembloids C_LI

14

Rapid plastid isolation reveals the chloroplast proteome and structures of the chlororibosome large subunit and RuBisCO in Marchantia polymorpha

Raval, P. K.; Mitchell, C.; Lozano-Quiles, M.; O'Keefe, S.; Nyman, T. A.; Battersby, B.; Butcher, S. J.; Gould, S. B.

2026-07-15 plant biology 10.64898/2026.07.14.738477 medRxiv

Top 9%

0.4%

Show abstract

Plastids house the biology of eukaryotic photosynthesis. The majority of a plastids proteome is imported after cytosolic translation, but a few dozen proteins on average remain organelle-encoded, translated by the plastids own ribosomes. While 1000s of plastid genomes have been sequenced, the availability of less than ten proteomes and only two species with full 70S plastid ribosomal structures limit our understanding of land plant evolution. To address this, we optimized a protocol for the rapid isolation of Marchantia polymorpha chloroplasts that provides a highly enriched and intact organelle fraction from gradient volumes as little as 2 mL. Our approach was successfully applied to six other species, including Chlamydomonas reinhardtii and Nicotiana tabacum. Focusing on M. polymorpha, we determined the proteome of the chloroplast fraction, identifying 1337 nuclear-encoded proteins with a high confidence, where 83% belong to orthologs shared with angiosperms. We further isolated large protein complexes by RNA affinity purification using poly-lysine and provide the high-resolution structures of the 50S subunit of the chloroplast ribosome and RuBisCO from this bryophyte using cryogenic EM and image reconstruction to 2.23 and 2.12 [A] resolution, respectively, highlighting the structural conservation of both complexes. For chloroplasts, our data show that the genome reduction event experienced by the common ancestor of bryophytes has had little impact on the organelles complexity and that they underscore a high level of structural conservation of core components of plastid biology. Our data provide novel resources and methods to explore the functional evolution of plastid proteomes and major macromolecular complexes of cyanobacterial origin.

15

Reliability-weighted target prioritization in CD4+ T-cell Perturb-seq: a generalizability-theory decomposition

Cheng, C.

2026-07-15 bioinformatics 10.64898/2026.07.13.738312 medRxiv

Top 9%

0.4%

Show abstract

Genome-scale Perturb-seq screens prioritize candidate targets by the strength of a perturbations transcriptional effect. Effect strength does not answer a prior measurement question: is the readout dependable? A large effect estimated from a single guide, a single donor, or a pseudobulk of few cells need not survive replication, and for target prioritization each false lead costs a validation experiment. We treat each perturbation effect as a measurement in a crossed Target x Guide x Donor x Condition design and apply generalizability theory (Brennan, 2001; Cronbach et al., 1972) to separate the dependable part of an effect from facet-specific idiosyncrasy. Guides and donors enter as random facets; condition enters as a fixed facet and is analyzed within its levels. For each target we report a dependability profile over the facets and a joint generalizability coefficient over the two random facets, and we re-rank targets by effect magnitude weighted by that coefficient. On the released screen (Zhu et al., 2025), removing the measurement-error floor estimated from the non-targeting controls raises the number of genes with a dependable target-signal share above .10 from 40 to 7,674. Analyzed within activation states, dependability recovers the T-cell-receptor signaling module as reliably measurable only in activated cells, without recourse to gene annotation. A design study indicates that reliability is limited by the number of guides rather than the number of donors, so a future screen should add guides. Every methodological decision was recorded and adversarially reviewed, and all results regenerate from the released summary statistics.

16

Hierarchical Gene Cluster Regulation Across Vertebrate Skins: Developmental Control of Keratin Gene Expression

Jea, W.-C.; Wu, P.; Chen, C.-K.; Chuong, C.-M.; Liang, Y.-C.

2026-07-15 developmental biology 10.64898/2026.07.14.738566 medRxiv

Top 9%

0.4%

Show abstract

Developmental competence allows tissues to respond to inductive cues before committing to specialized forms, but how this potential is encoded at clustered gene-family loci is poorly understood. We use vertebrate skin to address this problem. Epidermis responds to regional dermal signals before committing to feather, scale, or differentiated programs, and -keratin loci provide a stringent genomic test: separated type-I/type-II clusters show coordinated transcriptional pairing, yet individual keratin genes are selectively deployed across appendage, differentiation, and disease states. Using chicken developmental genomics with comparative mouse and human epidermal datasets, we show that -keratin clusters are organized before commitment as scaffolded chromatin domains. Within these domains, regulatory elements remain broadly accessible but acquire state-specific activity during commitment and differentiation. Inter-cluster contacts and chromatin-factor perturbation link this architecture to keratin output and morphology. These findings reveal a locus-level chromatin basis for developmental competence, enabling domain-level coordination with gene-level selectivity during epidermal diversification.

17

From amplicon to antigen: a quantified transmission map that nominates multi-antigen antibody-drug-conjugate co-target sets across cancer types

Lam, J. M.; Walker-Samuel, S.; Pennycuick, A.

2026-07-16 oncology 10.64898/2026.07.13.26357987 medRxiv

Top 10%

0.3%

Show abstract

Somatic copy-number amplification is pervasive in cancer, and the genes it carries are candidate drug targets - but only those whose amplification is transmitted to accessible surface protein can be reached by an antibody-drug conjugate (ADC). We build an integrated map of copy-number-to-protein transmission across six tumour types and ask, for every amplified gene, whether its dosage reaches the surface. Copy number transmits to mRNA (median per-gene r = 0.21) but is attenuated at the protein level in 85% of genes, and the mRNA ranking is largely preserved to protein (rho = 0.70); the ranking is set principally at the chromatin/transcription step - among directly measured regulatory inputs, promoter DNA methylation and tumour chromatin accessibility each explain about an order of magnitude more of the transmission variance than gene structure, and do so complementarily. Critically, transmissibility is a stable, gene-intrinsic property: it is predictable from gene properties alone, with no proteomic input, at a leave-gene-out rank correlation of 0.52 (R2 = 0.29); it is not positional (holding out whole chromosome arms changes accuracy by 0.001); and it transfers across lineages (Kendall W = 0.97 across leave-one-lineage-out refits). This licenses a predictor that nominates surface targets in cancer types that lack a tissue-referenced proteome, combining direct protein measurement where it is available with prediction where it is not. Requiring co-elevation on a recurrent amplicon with measured transmissibility and an accessible extracellular ectodomain nominates 22 surface antigens on 18 distinct recurrent amplicons across four cancer types (renal, endometrial and both lung subtypes) - for example ITGB8+TSPAN13+TTYH3 on lung 7p, NCSTN+HSD17B7+MPZL1 on 1q (recurrent in several types), the transferrin receptor TFRC on squamous 3q, and FZD1 on clear-cell renal 7q; 21 of the 22 are non-driver passengers and 10 are confirmed on the experimental Cell Surface Protein Atlas. In single malignant cells, against a null that controls for per-cell sequencing depth, the co-detected constructs sit at a modest 1.05-1.45x above independence (p < 0.001, donor-block bootstrap intervals clear of 1.0), and at binding-relevant thresholds the normal-tissue co-expression collapses - so an avidity AND-gate that binds stably only where the antigens co-occur would spare normal cells that carry only one. Observed transmissibility itself transfers strongly between the two lung subtypes ({rho} = 0.88) and remains positive across distant lineages, consistent with the shared cell-of-origin regulation the map implies. Single-cell co-detection is demonstrated wherever a malignant single-cell atlas exists (both lung subtypes and glioblastoma - the latter entirely from prediction, using no GBM surface-abundance measurement); the remaining cohorts are nominated on the same genetic and topological evidence. The result is a pan-cancer, confidence-tiered catalogue of multi-antigen ADC co-target sets with a concrete plan to test them.

18

NFIX missense variants that disrupt the β-hairpin loop result in a severe form of Malan syndrome in adolescence with rapidly evolving scoliosis and muscle wasting

Delagrammatikas, C. G.; Gourlay, L. J.; Priolo, M.; Russo, R.; Ahmadi, A.; Barbiroli, A. G.; Capelli, R.; Stowers, K.; D'Annibale, O.; Ravalin, M.; Tartaglia, M.; Nardini, M.; Cocanougher, B. T.

2026-07-19 genetic and genomic medicine 10.64898/2026.07.16.26357549 medRxiv

Top 10%

0.3%

Show abstract

Purpose: Pathogenic variants in NFIX cause Marshall-Smith syndrome and Malan syndrome (MALNS). We identified a severe subtype of MALNS characterized by adolescent-onset musculoskeletal deterioration and investigated functional consequences of underlying variants. Methods: Clinical data were collected from seven individuals with pathogenic NFIX variants. Wild-type and mutated recombinant NFIX DNA-binding domains (DBDs) were evaluated using biochemical, structural, and DNA-binding assays. Results: Six individuals carrying R116W, R116P, K125E, or G147E NFIX substitutions developed progressive muscle wasting, markedly reduced body mass index, and rapidly progressive scoliosis after the typical childhood features of MALNS; two died from disease-related complications. A seventh individual with R116G did not develop this severe phenotype. Functional studies on recombinant NFIX DBDs showed complete or near-complete loss of DNA-binding activity for R116W, R116P, K125E, and G147E despite preserved protein folding, consistent with disrupted DNA recognition and a potential dominant-negative mechanism. In contrast, R116G exhibited a 7.7{degrees}C decrease in thermal stability, which may support haploinsufficiency mediated by protein degradation. Conclusion: Specific NFIX missense variants define a severe subtype of MALNS associated with progressive musculoskeletal deterioration. In vitro functional studies support variant-specific disruption of DNA binding, providing a mechanistic basis of genotype-phenotype correlations and informing prognosis, clinical surveillance, and therapy development.

19

ZATT/ZNF451 promotes release of stalled TOP2 cleavage complexes

Leng, X.; Zarantonello, A.; Gadi, S. A.; Kakulidis, E.; Fey, P.; Ingham, A.; Hendiks, I. A.; Minocha, S.; Colding-Christensen, C.; Kristensen, S.; Willaume, S.; Palkova, N.; Gaubitz, C.; Garcia Lopez, A.; Bendix, P. M. M.; Sorensen, C. S.; Lund Nielsen, M.; Davey, N. E.; Mailand, N.; Miller, T.; Duxin, J. P.

2026-07-15 molecular biology 10.64898/2026.07.14.738426 medRxiv

Top 11%

0.3%

Show abstract

Topoisomerase II (TOP2) resolves DNA topological constraints through a tightly regulated cycle of DNA double-strand cleavage and religation. Nearby DNA damage or chemotherapeutic agents such as etoposide block the DNA religation step, stabilizing TOP2-DNA cleavage complexes (TOP2ccs) at DNA double-strand breaks (DSBs). The SUMO E3 ligase ZATT (ZNF451) has recently emerged as a key effector of TOP2cc repair, but its mechanism of action remains poorly understood. Here, we show that ZATT is sufficient to resolve TOP2ccs independently of TDP2, TOP2 proteolysis, and canonical DSB repair pathways. Using Xenopus egg extracts and biochemical reconstitution, we find that ZATT salvages trapped TOP2 by promoting TOP2 release from its stalled cleavage complex. Structural modeling and targeted mutagenesis in Xenopus egg extracts and human cells identify a highly conserved hydrophobic pocket in the tower domain of TOP2 where the ZATT coiled-coil "hooks on" to promote TOP2cc resolution. Our findings reveal a new strategy to resolve TOP2ccs that bypasses the exposure of dangerous DNA breaks.

20

Multimodal gene prioritization reveals nonlinear regulatory architecture in childhood-onset asthma

Huang, N.; Ragsac, M. F.; Gui, X.; Tantisira, K. G.; Amariuta, T.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357983 medRxiv

Top 12%

0.2%

Show abstract

Asthma is a heritable complex disease that disproportionately burdens minority and admixed populations in the US. However, the causal genes and regulatory mechanisms governing inherited risk remain largely unresolved. We performed a European-ancestry meta-analysis of 141,894 cases and 1,361,846 controls drawn from the Trans-national Asthma Genetic Consortium (TAGC) and Global Biobank Meta-analysis Initiative (GBMI), yielding an estimated h2SNP of 0.056 (SE = 0.0038) and 275 independently associated loci. To enhance mechanistic inference beyond variant-level associations, we developed a multimodal framework to predict asthma risk integrating GWAS summary statistics, bulk tissue expression quantitative trait loci (eQTL) data from the Genotype-Tissue Expression (GTEx) project, and single-cell gene eQTL data from the OneK1K Project. We performed transcriptome-wide association studies (TWAS) and subsequently applied probabilistic fine-mapping with FOCUS to prioritize putative causal genes expressed in bulk tissues and higher resolution immune cell populations. Fine-mapping asthma-associated genes implicated barrier-immune and metabolic-endocrine tissues alongside adaptive T-cell subsets as the primary mediators of asthma genetic risk, resolving canonical CD4+ Th2 effector genes including IL1RL1, TSLP, STAT6, and GATA3. Using these prioritized genes, we constructed a polygenic transcriptome risk score (PTRS) using random forest to integrate gene-level effects across critical tissues and cell types. Evaluated in two ancestrally distinct pediatric asthma cohorts, the Childhood Asthma Management Program (CAMP) and the Genetics of Asthma in Costa Rica Study (GACRS), our PTRS demonstrated improved transferability over the standard variant-level and gene-level baseline models. While modest common variant heritability limits the discriminative power of our models, we estimated a theoretical maximum achievable area under the receiver operating characteristic (AUROC) curve of 0.64. Our integrative nonlinear model of PRS-CSx and cross-modal (bulk tissue and single cell) FOCUS PTRS resulted in the best cross-cohort performance (CAMP AUC = 0.632, sd = 0.04, 3.55 case/control odds ratio in top vs. bottom quartiles), representing an increase of +0.118 AUC over PRS-CSx, +0.067 AUC over tissue-specific TWAS pruning and thresholding, and +0.041 AUC over cell-type-specific FOCUS PTRS. Our results demonstrate that modeling nonlinear interactions between variant- and gene-level effects across both bulk tissue and single cell eQTL data improves our ability to determine high-risk individuals and to explain the likely mechanisms driving genetic susceptibility of childhood-onset asthma.